-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server/heapprofiler: don't consider "goIdle" memory #37412
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds reasonable, but I'd like to suck you into a little cleanup before this lands (I wrote most of the code in the comment below to make this easier to swallow - hope it works 😄)
The RSS size and to Go stats are taken separately because apparently at some point we had problems with the duration of the call to retrieve the Go stats:
cockroach/pkg/server/server.go
Lines 1916 to 1920 in 4850713
// We run two separate sampling loops, one for memory stats (via | |
// ReadMemStats) and one for all other runtime stats. This is necessary | |
// because as of go1.11, runtime.ReadMemStats() "stops the world" and | |
// requires waiting for any current GC run to finish. With a large heap, a | |
// single GC run may take longer than the default sampling period (10s). |
The result is kind of crappy. We get the RSS in one loop, and the Go stats in the other, so they could be, at worst, 10s out of sync (or more if these stalls still happen).
What I think we should do is the following:
- unify the stats collection, i.e. move
cockroach/pkg/server/status/runtime.go
Line 409 in 034df0b
mem := gosigar.ProcMem{} SampleEnvironment
and store both that and the Go stats in atype MemStats struct { gosigar.ProcMem runtime.MemStats }
that has convenience functions to get "good numbers" out of it. Note that it makes sense to call gosigar last (i.e. after getting go mem stats), so that it's taken very soon after the Go stats computation returns. - try to sync up these loops somewhat. I'm thinking the following
// As of go1.12, runtime.ReadMemStats() "stops the world" and requires
// waiting for any current GC run to finish. With a large heap, a single GC
// run may take longer than the default sampling period (10s). The code below
// is resilient to this sort of thing and will continue to run roughly at the
// defined frequency, using the last stats that were in fact collected.
s.stopper.RunWorker(ctx, func(ctx context.Context) {
timer := timeutil.NewTimer()
defer timer.Stop()
timer.Reset(frequency)
var memStats atomic.Value // *base.MemStats
var collecting int32 // atomic, 1 when stats call is ongoing
memStats.Store(&base.MemStats{})
for {
select {
case <-timer.C:
timer.Read = true
timer.Reset(frequency)
statsCollected := make(chan struct{})
if atomic.CompareAndSwapInt64(&collecting, 0, 1) {
if err := s.stopper.RunAsyncTask(ctx, "go-mem-stats", func(ctx context.Context) {
curStats := s.runtime.SampleMemStats(ctx)
memStats.Store(&curStats)
close(statsCollected)
atomic.StoreInt32(&collecting, 0)
}); err != nil {
close(statsCollected)
}
} else {
close(statsCollected)
}
select {
case <-time.After(time.Second):
// Maybe log a warning, but continue to use the old stats in
// this iteration.
case <-statsCollected:
// Common case, we have updated stats now.
}
curStats := memStats.Load().(*base.MemStats)
s.runtime.SampleEnvironment(ctx, curStats)
if goroutineDumper != nil {
goroutineDumper.MaybeDump(ctx, s.ClusterSettings(), s.runtime.Goroutines.Value())
}
if heapProfiler != nil {
heapProfiler.MaybeTakeProfile(ctx, s.ClusterSettings(), curStats)
}
case <-s.stopper.ShouldStop():
return
}
}
})
When we have that, we can think about what the meaningful thing for the heap profiler to look at is. I believe RSS is always somewhat inflated (at least on linux) due to the use of MADVISE. Also, the heap profiler never tells us anything about allocs on the non-Go side, so it doesn't make a whole lot of sense to use it for the heap profiler in the first place; the only thing that really matters is some combination of GoAllocated and GoIdle (but note that jemalloc can dump heap also, an extension which might make sense in the future but not if we switch to pebble).
HeapIdle
has this comment:
// HeapIdle minus HeapReleased estimates the amount of memory
// that could be returned to the OS, but is being retained by
// the runtime so it can grow the heap without requesting more
// memory from the OS. If this difference is significantly
// larger than the heap size, it indicates there was a recent
// transient spike in live heap size.
HeapIdle uint64
I agree that it's better to limit to the non-idle Go memory as a trigger. Lots of idle memory tells us that something interesting happened frequently, but we missed it, but maybe we did actually catch it earlier.
- remove
fractionSystemMemoryHeuristic
: we've established that we can't really reason about what a good threshold here is because we always expect sizeable chunks of memory on the C++ side - introduce a high watermark heuristic: dump every time we see a new max for non-idle go heap size (or perhaps every time we see more than 1.3x the previous max, or something like that). Heap dumps are fairly cheap because they just write out info that is collected during GC.
Reviewed 5 of 5 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @lucy-zhang, @Nishant9, and @tbg)
pkg/base/mem_stats.go, line 17 at r2 (raw file):
package base // MemStats represents information from the Go allocator.
This seems like sort of a random struct that's kind of useful, but not really useful, and also not aptly named. See the suggestion on the main thread.
Please see now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r3, 5 of 5 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @lucy-zhang, and @Nishant9)
pkg/base/mem_stats.go, line 1 at r4 (raw file):
// Copyright 2018 The Cockroach Authors.
19
pkg/server/server.go, line 1952 at r4 (raw file):
I'd clarify this to say
With a large heap and under extreme conditions, a single GC run may take longer than the default sampling period of 10s. Under normal operations and with more recent versions of Go, this hasn't been observed to be a problem.
pkg/server/heapprofiler/heapprofiler.go, line 94 at r4 (raw file):
} // fractionSystemMemoryHeuristic is true if latest (RSS - goIdle) is more than
Thinking about this more, it still strikes me as silly that we're comparing apples (RSS) to oranges (goIdle) here. How about we always take a profile if GoActualHeapBytes() is larger than the max seen thus far? We don't have to worry about the min profile interval since we've resolved that taking a heap profile is cheap and because the profiler is already triggered periodically at a low enough frequency (though if you feel that we should retain some protection here, that's fine by me too but I would lower it to 10s).
pkg/server/heapprofiler/heapprofiler_test.go, line 52 at r4 (raw file):
assert.Nil(t, err) if len(expectedScores) <= numProfiles { t.Fatalf("unexected profile: %d", numProfiles)
unexpected
pkg/server/status/runtime.go, line 502 at r4 (raw file):
humanize.IBytes(mem.Resident), numGoroutine, humanize.IBytes(ms.GoAllocated), humanize.IBytes(ms.GoIdle), humanize.IBytes(ms.GoTotal), staleMsg,
Nice touch.
pkg/server/status/runtime.go, line 550 at r4 (raw file):
goAllocated := ms.Alloc goTotal := ms.Sys - ms.HeapReleased rsr.GoAllocBytes.Update(int64(goAllocated))
Move this to SampleEnvironment
, then remove this method completely (we just call runtime.ReadMemStats
at the caller).
pkg/server/status/runtime.go, line 556 at r4 (raw file):
return base.GoMemStats{ GoAllocated: goAllocated, GoIdle: ms.HeapIdle - ms.HeapReleased,
Please just use runtime.MemStats
here (and use it in base.MemStats
) and do all the math later with an accessor. That way it'll be much easier to navigate later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @lucy-zhang, @Nishant9, and @tbg)
pkg/server/heapprofiler/heapprofiler.go, line 94 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Thinking about this more, it still strikes me as silly that we're comparing apples (RSS) to oranges (goIdle) here. How about we always take a profile if GoActualHeapBytes() is larger than the max seen thus far? We don't have to worry about the min profile interval since we've resolved that taking a heap profile is cheap and because the profiler is already triggered periodically at a low enough frequency (though if you feel that we should retain some protection here, that's fine by me too but I would lower it to 10s).
well I thought it's silly to take profiles when the memory usage is "low", regardless of whether we've reached a high-water mark or not. And we don't really know what the target for all the go memory is, since so much memory is taken by CGo. So I want something that says "if there's plenty of free memory on the system, don't take a profile".
And also I don't know about only taking profiles on high-water marks... I think while memory is "high" it's better to take more profiles periodically.
So, I dunno... It'd still go with what I have here. And hopefully we'll switch to Pebble and then we can talk about just Go memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @lucy-zhang, and @Nishant9)
pkg/server/heapprofiler/heapprofiler.go, line 94 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
well I thought it's silly to take profiles when the memory usage is "low", regardless of whether we've reached a high-water mark or not. And we don't really know what the target for all the go memory is, since so much memory is taken by CGo. So I want something that says "if there's plenty of free memory on the system, don't take a profile".
And also I don't know about only taking profiles on high-water marks... I think while memory is "high" it's better to take more profiles periodically.So, I dunno... It'd still go with what I have here. And hopefully we'll switch to Pebble and then we can talk about just Go memory.
You're right that the first couple of profiles won't be useful, but it's not silly because it eliminates needing to know what a "reasonable" threshold for Go memory usage is, which is what has been the real silliness in this code thus far. Trying to fix that is nontrivial and we don't have much to win by even trying. Even when there's no C++ involved I'd prefer the robustness of the simple approach (proposed above). I'll also add that I originally asked for these complications to be introduced, which I regret now. It would've saved us a number of eng hours to just take a heap profile whenever a new high watermark is reached.
I can take this change over if you'd rather spend your time somewhere else.
b915432
to
9093c63
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @lucy-zhang, @Nishant9, and @tbg)
pkg/server/heapprofiler/heapprofiler.go, line 94 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
You're right that the first couple of profiles won't be useful, but it's not silly because it eliminates needing to know what a "reasonable" threshold for Go memory usage is, which is what has been the real silliness in this code thus far. Trying to fix that is nontrivial and we don't have much to win by even trying. Even when there's no C++ involved I'd prefer the robustness of the simple approach (proposed above). I'll also add that I originally asked for these complications to be introduced, which I regret now. It would've saved us a number of eng hours to just take a heap profile whenever a new high watermark is reached.
I can take this change over if you'd rather spend your time somewhere else.
ok, no more RSS. Highwater-mark + periodic reset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you see the other comments I had before your last push regarding to GoMemStats
? I feel rather strongly that they're worth addressing. I just spent 15 minutes trying to figure out what numbers we actually use now, which I wouldn't have had to do with my suggestion addressed. (I also didn't manage to convince myself, so let's address the comment and then I'll try again).
Reviewed 5 of 5 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @lucy-zhang, @Nishant9, and @tbg)
pkg/base/mem_stats.go, line 21 at r6 (raw file):
// GoMemStats represents information from the Go allocator. // All units are bytes. type GoMemStats struct {
I hate to bring this up again, but I find it hard to justify this over just embedding a runtime.MemStats
in base.MemStats
. What's the thinking there? Things have names that don't match runtime.MemStats
too which adds extra friction. I really think this just ought to be runtime.MemStats
so that I can just navigate to the explanation from the stdlib instead of trusting that the numbers get propagated to the right fields and that the names match my expectations.
Since you're also holding on to Collected
you'll probably want
type GoMemStats struct {
runtime.MemStats
Collected time.Time
}
with an accessor NotIdle()
which boils down to whatever the final formula for it is (I tried piecing it together from this PR, but it's really not that straightforward and I also wasn't super confident that the result was the right thing, but let's discuss that when it's written down in one place here).
pkg/server/server.go, line 1952 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I'd clarify this to say
With a large heap and under extreme conditions, a single GC run may take longer than the default sampling period of 10s. Under normal operations and with more recent versions of Go, this hasn't been observed to be a problem.
Ping
pkg/server/server.go, line 1947 at r6 (raw file):
statsCollected := make(chan struct{}) if atomic.CompareAndSwapInt32(&collectingMemStats, 0, 1) { if err := s.stopper.RunTask(ctx, "mem-stats-sample", func(ctx context.Context) {
RunAsyncTask
or the whole dance is useless :-)
pkg/server/heapprofiler/heapprofiler.go, line 47 at r6 (raw file):
) type stats struct {
?
pkg/server/heapprofiler/heapprofiler.go, line 101 at r6 (raw file):
} curMemUse := memStats.Go.GoTotal - memStats.Go.GoIdle
Put this in a method on memStats.Go
, for example memStats.Go.HeapNotIdle()
. You
pkg/server/status/runtime.go, line 550 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Move this to
SampleEnvironment
, then remove this method completely (we just callruntime.ReadMemStats
at the caller).
Ping
4af67c3
to
d13fee8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @lucy-zhang, @Nishant9, and @tbg)
pkg/base/mem_stats.go, line 1 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
19
Done.
pkg/base/mem_stats.go, line 21 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I hate to bring this up again, but I find it hard to justify this over just embedding a
runtime.MemStats
inbase.MemStats
. What's the thinking there? Things have names that don't matchruntime.MemStats
too which adds extra friction. I really think this just ought to beruntime.MemStats
so that I can just navigate to the explanation from the stdlib instead of trusting that the numbers get propagated to the right fields and that the names match my expectations.Since you're also holding on to
Collected
you'll probably wanttype GoMemStats struct { runtime.MemStats Collected time.Time }with an accessor
NotIdle()
which boils down to whatever the final formula for it is (I tried piecing it together from this PR, but it's really not that straightforward and I also wasn't super confident that the result was the right thing, but let's discuss that when it's written down in one place here).
ok, see now
pkg/server/server.go, line 1952 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Ping
done
pkg/server/server.go, line 1947 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
RunAsyncTask
or the whole dance is useless :-)
indeed
pkg/server/heapprofiler/heapprofiler.go, line 47 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
?
removed
pkg/server/heapprofiler/heapprofiler.go, line 101 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Put this in a method on
memStats.Go
, for examplememStats.Go.HeapNotIdle()
. You
Done.
pkg/server/heapprofiler/heapprofiler_test.go, line 52 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
unexpected
n/a
pkg/server/status/runtime.go, line 502 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Nice touch.
Done.
pkg/server/status/runtime.go, line 550 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Ping
done
pkg/server/status/runtime.go, line 556 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Please just use
runtime.MemStats
here (and use it inbase.MemStats
) and do all the math later with an accessor. That way it'll be much easier to navigate later.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sticking it out! Please add release note about the removed cluster setting. One more substantial comment that's easy to address, one nit, and then we're good to go 🙏
Reviewed 2 of 7 files at r6, 5 of 5 files at r7.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @lucy-zhang, @Nishant9, and @tbg)
pkg/base/mem_stats.go, line 25 at r7 (raw file):
// All units are bytes. type GoMemStats struct { // MemStats is a pointer because it's a large struct. We want GoMemStats to be
This isn't anywhere close to any hot path and never will be, so this looks like premature optimization. Even if we cared, wouldn't you just embed the value anyway and deal in *GoMemStats
elsewhere? You can runtime.ReadMemStats(&goMemStats.MemStats)
.
pkg/base/mem_stats.go, line 43 at r7 (raw file):
// is counted as "in use" by this method. func (ms *GoMemStats) GoInUse() uint64 { // NOTE(andrei): It may seem weird that we're only subtracting HeapReleased
I think we're quite likely to shoot ourselves in the foot here in some way. HeapReleased
is a counter (i.e. only ever goes up, or am I reading that wrong?) but Sys
isn't. And even if they're actually comparable quantities, I don't think we can win anything here by trying to account for "as much Go memory as we can" here because the profile will show only the heap, and because the heuristic isn't properly exercised at scale (i.e. there's no test that puts a cluster under heap pressure at scale and verifies that we get sensible profile scores over extended periods of time and I don't think we're planning to write one). Let's use HeapAlloc
which is certifiably the thing we care about (and it shouldn't include HeapIdle
which was the original motivation for this PR):
// HeapAlloc is bytes of allocated heap objects.
//
// "Allocated" heap objects include all reachable objects, as
// well as unreachable objects that the garbage collector has
// not yet freed. Specifically, HeapAlloc increases as heap
// objects are allocated and decreases as the heap is swept
// and unreachable objects are freed. Sweeping occurs
// incrementally between GC cycles, so these two processes
// occur simultaneously, and as a result HeapAlloc tends to
// change smoothly (in contrast with the sawtooth that is
// typical of stop-the-world garbage collectors).
HeapAlloc uint64
There's a chance that we'll sometimes see crashes in which the heap remains bounded but some other Go memory (stacks when goroutine leaking perhaps?) explodes, but these are rarer and I expect them to show other signs that we can interpret. Either way, a heap profile in that case wouldn't show us much except that the heap wasn't huge. This information is better conveyed by logging the heap size in the runtime stats, which we do already:
cockroach/pkg/server/status/runtime.go
Line 552 in 034df0b
goAllocated := ms.Alloc |
pkg/server/server.go, line 1968 at r7 (raw file):
// Good; we managed to read the Go memory stats quickly enough. case <-time.After(time.Second): log.Infof(ctx, "get-mem-stats hasn't finished in one second; "+
Now that I'm looking at this, is this logging necessary? The runtime stats will print (stale)
, right? Your call.
pkg/server/heapprofiler/heapprofiler.go, line 56 at r7 (raw file):
// // MaybeTakeProfile() is supposed to be called periodically. A profile is taken // every time the current Go memory usage is the high-water mark. The recorded
s/memory/heap/ (in accordance with my comment about HeapAlloc)
pkg/server/status/runtime.go, line 389 at r7 (raw file):
// to keep runtime statistics current. // // SampleEnvironment takes GoMemStats as input because that it collected
s/it/is/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @lucy-zhang, @Nishant9, and @tbg)
pkg/base/mem_stats.go, line 25 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This isn't anywhere close to any hot path and never will be, so this looks like premature optimization. Even if we cared, wouldn't you just embed the value anyway and deal in
*GoMemStats
elsewhere? You canruntime.ReadMemStats(&goMemStats.MemStats)
.
Done.
pkg/base/mem_stats.go, line 43 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I think we're quite likely to shoot ourselves in the foot here in some way.
HeapReleased
is a counter (i.e. only ever goes up, or am I reading that wrong?) butSys
isn't. And even if they're actually comparable quantities, I don't think we can win anything here by trying to account for "as much Go memory as we can" here because the profile will show only the heap, and because the heuristic isn't properly exercised at scale (i.e. there's no test that puts a cluster under heap pressure at scale and verifies that we get sensible profile scores over extended periods of time and I don't think we're planning to write one). Let's useHeapAlloc
which is certifiably the thing we care about (and it shouldn't includeHeapIdle
which was the original motivation for this PR):// HeapAlloc is bytes of allocated heap objects. // // "Allocated" heap objects include all reachable objects, as // well as unreachable objects that the garbage collector has // not yet freed. Specifically, HeapAlloc increases as heap // objects are allocated and decreases as the heap is swept // and unreachable objects are freed. Sweeping occurs // incrementally between GC cycles, so these two processes // occur simultaneously, and as a result HeapAlloc tends to // change smoothly (in contrast with the sawtooth that is // typical of stop-the-world garbage collectors). HeapAlloc uint64There's a chance that we'll sometimes see crashes in which the heap remains bounded but some other Go memory (stacks when goroutine leaking perhaps?) explodes, but these are rarer and I expect them to show other signs that we can interpret. Either way, a heap profile in that case wouldn't show us much except that the heap wasn't huge. This information is better conveyed by logging the heap size in the runtime stats, which we do already:
cockroach/pkg/server/status/runtime.go
Line 552 in 034df0b
goAllocated := ms.Alloc
Sys
is also a counter. That's spelled out in HeapSys
for example, which is one of its constituents, that says as well as virtual address space for which the physical memory has been returned to the OS after it became unused
.
The reason why I had written what I wrote was because I thought that the stacks are actually reflected somehow in the profile - I kinda remember seeing something about allocating stacks, but maybe I'm thinking of CPU profiles not memory profiles. In any case, I went with HeapAlloc
.
pkg/server/server.go, line 1968 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Now that I'm looking at this, is this logging necessary? The runtime stats will print
(stale)
, right? Your call.
removed
pkg/server/heapprofiler/heapprofiler.go, line 56 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
s/memory/heap/ (in accordance with my comment about HeapAlloc)
Done.
pkg/server/status/runtime.go, line 389 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
s/it/is/
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI complains, I think it's because various tests now create errant profiles and we check for stray files at the end. Probably have to avoid writing profiles if the profiler is set up with an empty dir.
Reviewed 5 of 5 files at r8.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @lucy-zhang, and @Nishant9)
pkg/server/heapprofiler/heapprofiler.go, line 56 at r8 (raw file):
Go heap allocated bytes exceeds the previous
pkg/server/heapprofiler/heapprofiler_test.go, line 38 at r8 (raw file):
{0, 30, true}, // we always take the first profile {10, 40, true}, // new high-water mark {20, 30, false}, // below high-water mark; no profile
Add a few here to test GC? Or is there a dedicated test elsewhere?
Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed. Now the profiler is disabled in unit tests. Curious if I need to disable in... any other tests. We'll see.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @lucy-zhang, @Nishant9, and @tbg)
pkg/server/heapprofiler/heapprofiler.go, line 56 at r8 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Go heap allocated bytes exceeds the previous
Done.
pkg/server/heapprofiler/heapprofiler_test.go, line 38 at r8 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Add a few here to test GC? Or is there a dedicated test elsewhere?
well testing GC would need some more mocking. The profiler used in this test is configured to not actually write any files.
I'll leave it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r9, 8 of 8 files at r10.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @lucy-zhang, and @Nishant9)
63d8aac
to
608e691
Compare
Before this patch we were taking a heap profile once RSS got above 85% of system memory and then we wouldn't take another profile until we go below it and then again above it. The problem with the RSS is that it includes "goIdle" memory (memory allocated from the OS that's not actually in use) ; I've seen it be up to several gigs. With this patch, we now take a profile every time a new high-water mark is reached. The recorded high-water mark is reset once an hour, this way ensuring that one big measurement doesn't stop us forever from taking another profile. It's simpler and it has a chance of catching some further heap increases. The rationale behind the patch is that we want profiles when the heap is large more than when the RSS is large. I'm looking at a case where we took a heap profile when the heap was 4.5 gigs with 2 gigs idle and then never took one again because of how the heuristic works. And then we OOMed when the heap was larger and the idle space was lower, but the RSS was about the same. With this patch, we would have taken a profile at a more interesting time. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @lucy-zhang, @Nishant9, and @tbg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @lucy-zhang, @Nishant9, and @tbg)
bors pls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
borsy boy
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @lucy-zhang, @Nishant9, and @tbg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @lucy-zhang, @Nishant9, and @tbg)
36101: roachprod: Minor readme updates r=bobvawter a=bdarnell Use new convenience commands and demonstrate running a workload Release note: None 37282: exec: add templating for RANK and ROW_NUMBER window functions r=yuzefovich a=yuzefovich Templates out rankOp into two operators (for dense and non-dense case) and templates out support for PARTITION BY clause for both rankOp and rowNumberOp. The code is undertested and is lacking benchmarks, but that will be addressed soon. 37412: server/heapprofiler: don't consider "goIdle" memory r=andreimatei a=andreimatei Before this patch we were taking a heap profile once RSS got above 85% of system memory and then we wouldn't take another profile until we go below it and then again above it. The problem with the RSS is that it includes "goIdle" memory (memory allocated from the OS that's not actually in use) ; I've seen it be up to several gigs. With this patch, we now take a profile every time a new high-water mark is reached. The recorded high-water mark is reset once an hour, this way ensuring that one big measurement doesn't stop us forever from taking another profile. It's simpler and it has a chance of catching some further heap increases. The rationale behind the patch is that we want profiles when the heap is large more than when the RSS is large. I'm looking at a case where we took a heap profile when the heap was 4.5 gigs with 2 gigs idle and then never took one again because of how the heuristic works. And then we OOMed when the heap was larger and the idle space was lower, but the RSS was about the same. With this patch, we would have taken a profile at a more interesting time. Release note: None Co-authored-by: Ben Darnell <ben@bendarnell.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Build succeeded |
Before this patch we were taking a heap profile once RSS got above 85%
of system memory and then we wouldn't take another profile until we go
below it and then again above it. The problem with the RSS is that it
includes "goIdle" memory (memory allocated from the OS that's not
actually in use) ; I've seen it be up to several gigs.
With this patch, we now take a profile every time a new high-water mark
is reached. The recorded high-water mark is reset once an hour, this way
ensuring that one big measurement doesn't stop us forever from taking
another profile. It's simpler and it has a chance of catching some
further heap increases.
The rationale behind the patch is that we want profiles when the heap is
large more than when the RSS is large. I'm looking at a case where we
took a heap profile when the heap was 4.5 gigs with 2 gigs idle and then
never took one again because of how the heuristic works. And then we
OOMed when the heap was larger and the idle space was lower, but the RSS
was about the same. With this patch, we would have taken a profile at a
more interesting time.
Release note: None